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Oj, Sparse model estimation is a topic of high importance in mod- 

ern data analysis due to the increasing availability of data sets with 
a large number of variables. Another common problem in applied 

C~~- , statistics is the presence of outliers in the data. This paper combines 

robust regression and sparse model estimation. A robust and sparse 
estimator is introduced by adding an Li penalty on the coefficient 

^ . estimates to the well-known least trimmed squares (LTS) estimator. 

The breakdown point of this sparse LTS estimator is derived, and a 
fast algorithm for its computation is proposed. In addition, the sparse 
LTS is applied to protein and gene expression data of the NCL60 can- 
cer cell panel. Both a simulation study and the real data application 
'^, , show that the sparse LTS has better prediction performance than its 

competitors in the presence of leverage points. 
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-^ I 1. Introduction. In applied data analysis, there is an increasing avail- 

[--. ' ability of data sets containing a large number of variables. Linear models 

C^^ . that include the full set of explanatory variables often have poor prediction 

^'. I performance as they tend to have large variance. Furthermore, large models 

^T . are in general difficult to interpret. In many cases, the number of variables 

^-^ ' is even larger than the number of observations. Traditional methods such 

as least squares can then no longer be applied due to the rank deficiency of 
the design matrix. For instance, gene expression or fMRI studies typically 
contain tens of thousands of variables for only a small number of observa- 
tions. In this paper, we present an application to the cancer cell panel of 
C^ , the National Cancer Institute, in which the data consists of 59 observations 

and 22,283 predictors. 

To improve prediction accuracy and as a remedy to computational prob- 
lems with high-dimensional data, a penalty term on the regression coeffi- 
cients can be added to the objective function. This approach shrinks the 
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coefficients and reduces variance at the price of increased bias. Tibshirani 
(1996) introduced the least absolute shrinkage and selection operator (lasso), 
in which the penalty function is the Li norm. Let y = (yi, . . . ,2/n)' be the 
response and X = ixij)i<i<n,i<j<p the matrix of predictor variables, where 
n denotes the number of observations and p the number of variables. In 
addition, let xi, . . . ,x„ be the p-dimensional observations, that is, the rows 
of X. We assume a standard regression model 

(1-1) yi = x'^(3 + ei, 

where the regression parameter is (3 = (/3i, . . . ,/3p)', and the error terms Si 
have zero expected value. With a penalty parameter A, the lasso estimate 
of /3 is 

n p 

(1.2) ^i^sso = argmin^(yi - x^/3)^ + nX^\(3j\. 

^ i=i j=i 

The lasso is frequently used in practice since the Li penalty allows to shrink 
some coefficients to exactly zero, that is, to produce sparse model estimates 
that are highly interpretable. In addition, a fast algorithm for computing the 
lasso is available through the framework of least angle regression [LARS; 
Efron et al. (2004)]. Other algorithms are available as well [e.g., Wu and 
Lange (2008)]. Due to the popularity of the lasso, its theoretical properties 
are well studied in the literature [e.g., Knight and Fu (2000), Zhao and Yu 
(2006), Zou, Hastie and Tibshirani (2007)] and several modifications have 
been proposed [e.g., Zou (2006), Yuan and Lin (2006), Gertheiss and Tutz 
(2010), Radchenko and James (2011), Wang et al. (2011)]. However, the lasso 
is not robust to outliers. In this paper we formally show that the breakdown 
point of the lasso is 1/n, that is, only one single outlier can make the lasso 
estimate completely unreliable. Therefore, robust alternatives are needed. 

Outliers are observations that deviate from the model assumptions and are 
a common problem in the practice of data analysis. For example, for many of 
the 22,283 predictors in the NCI data set used in Section 7, (log-transformed) 
responses on the 59 cell lines showed outliers. Robust alternatives to the 
least squares regression estimator are well known and studied; see Maronna, 
Martin and Yohai (2006) for an overview. In this paper, we focus on the least 
trimmed squares (LTS) estimator introduced by Rousseeuw (1984). This 
estimator has a simple definition, is quite fast to compute, and is probably 
the most popular robust regression estimator. Denote the vector of squared 
residuals by r^(/3) = {rf, . . . ,r^)' with rf = {yi — x^/3)^, i = 1, . . . ,n. Then 
the LTS estimator is defined as 

h 
(1-3) ^LTS = argminJ](r2(/3)).„, 
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where (r^(/3))i:„, < ••• < (r^(/3))„:„ are the order statistics of the squared 
residuals and h <n. Thus, LTS regression corresponds to finding the sub- 
set of h observations whose least squares fit produces the smallest sum of 
squared residuals. The subset size h can be seen as an initial guess of the 
number of good observations in the data. While the LTS is highly robust, it 
clearly does not produce sparse model estimates. Furthermore, if h <p, the 
LTS estimator cannot be computed. A sparse and regularized version of the 
LTS is obtained by adding an Li penalty with penalty parameter A to (1.3), 
leading to the sparse LTS estimator 

h p 

(1-4) ^sparseLTS = argmin^(r2(/3)) .^„ + hxj^l^jl- 

We prove in this paper that sparse LTS has a high breakdown point. It is 
resistant to multiple regression outliers, including leverage points. Besides 
being highly robust, and similar to the lasso estimate, sparse LTS (i) im- 
proves the prediction performance through variance reduction if the sample 
size is small relative to the dimension, (ii) ensures higher interpretability 
due to simultaneous model selection, and (iii) avoids computational prob- 
lems of traditional robust regression methods in the case of high-dimensional 
data. For the NCI data, sparse LTS was less influenced by the outliers than 
competitor methods and showed better prediction performance, while the 
resulting model is small enough to be easily interpreted (see Section 7). 

The sparse LTS (1.4) can also be interpreted as a trimmed version of 
the lasso, since the limit case h = n yields the lasso solution. Other robust 
versions of the lasso have been considered in the literature. Most of them 
are penalized M-estimators, as in van de Geer (2008) and Li, Peng and 
Zhu (2011). Rosset and Zhu (2004) proposed a Huber-type loss function, 
which requires knowledge of the residual scale. A least absolute deviations 
(LAD) type of estimator called LAD-lasso is proposed by Wang, Li and 
Jiang (2007), 

n p 

(1-5) ^LAD-lasso = argmin^|yj - x'^/J] + nA J]] |/3j|. 

^ i=i j=i 

However, none of these methods is robust with respect to leverage points, 
that is, outliers in the predictor space, and can handle outliers only in the 
response variable. The main competitor of the sparse LTS is robust least an- 
gle regression, called RLARS, and proposed in Khan, Van Aelst and Zamar 
(2007). They develop a robust version of the LARS algorithm, essentially 
replacing correlations by a robust type of correlation, to sequence and select 
the most important predictor variables. Then a nonsparse robust regression 
estimator is applied to the selected predictor variables. RLARS, as will be 
confirmed by our simulation study, is robust with respect to leverage points. 
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A main drawback of the RLARS algorithm of Khan, Van Aelst and Zamar 
(2007) is the lack of a natural definition, since it is not optimizing a clearly 
defined objective function. 

An entirely different approach is taken by She and Owen (2011), who 
propose an iterative procedure for outlier detection. Their method is based 
on imposing a sparsity criterion on the estimator of the mean-shift parameter 
7 in the extended regression model 

(1.6) y = X/3 + 7 + £. 

They stress that this method requires a nonconvex sparsity criterion. An ex- 
tension of the method to high-dimensional data is obtained by also assuming 
sparsity of the coefficients /3. Nevertheless, their paper mainly focuses on 
outlier detection and much less on sparse robust estimation. Note that an- 
other procedure for simultaneous outlier identification and variable selection 
based on the mean-shift model is proposed by Menjoge and Welsch (2010). 
The rest of the paper is organized as follows. In Section 2 the breakdown 
point of the sparse LTS estimator is obtained. Further, we also show that 
the lasso and the LAD-lasso have a breakdown point of only 1/n. A detailed 
description of the proposed algorithm to compute the sparse LTS regression 
estimator is provided in Section 3. Section 4 introduces a reweighted version 
of the estimator in order to increase statistical efficiency. The choice of the 
penalty parameter A is discussed in Section 5. Simulation studies are per- 
formed in Section 6. In addition, Section 7 presents an application to protein 
and gene expression data of the well-known cancer cell panel of the National 
Cancer Institute. The results indicate that these data contain outliers such 
that robust methods are necessary for analysis. Moreover, sparse LTS yields 
a model that is easy to interpret and has excellent prediction performance. 
Finally, Section 8 presents some computation times and Section 9 concludes. 

2. Breakdown point. The most popular measure for the robustness of 
an estimator is the replacement finite-sample breakdown point (FBP) [e.g., 
Maronna, Martin and Yohai (2006)]. Let Z = (X,y) denote the sample. For 
a regression estimator (3, the breakdown point is defined as 

(2.1) e*(/3;Z)=min|-:sup||/3(Z)||2 = oo|, 

where Z are corrupted data obtained from Z by replacing m of the original 
n data points by arbitrary values. We obtained the following result, from 
which the breakdown point of the sparse LTS estimator immediately follows. 
The proof is in the Appendix. 

Theorem 1. Let p{x) be a convex and symmetric loss function with 
/j(0) = and p{x) > for x / 0, and define p(x) := {p{xi), . . . ,p{xn))'- With 
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subset size h <n, consider the regression estimator 

h p 

(2.2) P = ^TgmmY,{p{y-^(3)),..n + hX^m, 

^ i=i j=i 

where (p(y — X/3)))i;„ < • • • < (p(y — X/3))„:„ are the order statistics of the 
regression loss. Then the breakdown point of the estimator /3 is given by 

n-h+1 



£*(/3;Z) 



n 



The breakdown point is the same for any loss function p fulfilhng the as- 
sumptions. In particular, the breakdown point for the sparse LTS estimator 
/^sparseLTS with subset size h<n, in which p(x) = x^, is still {n— h+ l)/n. 
The smaller the value of /i, the higher the breakdown point. By taking h 
small enough, it is even possible to have a breakdown point larger than 50%. 
However, while this is mathematically possible, we are not advising to use 
h < n/2 since robust statistics aim for models that fit the majority of the 
data. Thus, we do not envisage to have such large breakdown points. Instead, 
we suggest to take a value of h equal to a fraction a of the sample size, with 
a = 0.75, such that the final estimate is based on a sufficiently large num- 
ber of observations. This guarantees a sufficiently high statistical efficiency, 
as will be shown in the simulations in Section 6. The resulting breakdown 
point is then about 1 — a = 25%. Notice that the breakdown point does not 
depend on the dimension p. Even if the number of predictor variables is 
larger than the sample size, a high breakdown point is guaranteed. For the 
nonsparse LTS, the breakdown point does depend on p [see Rousseeuw and 
Leroy (2003)]. 

Applying Theorem 1 to the lasso [corresponding to p{x) = x^ and h = n] 
yields a finite-sample breakdown point of 

£ (AassojZ) = — . 

n 
Hence, only one outlier can already send the lasso solution to infinity, despite 
the fact that large values of the regression estimate are penalized in the 
objective function of the lasso. The nonrobustness of the Lasso comes from 
the use of the squared residuals in the objective function (1.2). Using other 
convex loss functions, as done in the LAD-lasso or penalized M-estimators, 
does not solve the problem and results in a breakdown point of 1/n as well. 
The theoretical results on robustness are also refiected in the application to 
the NCI data in Section 7, where the lasso is much more infiuenced by the 
outliers than the sparse LTS. 

3. Algorithm. We first present an equivalent formulation of the sparse 
LTS estimator (1.4). For a fixed penalty parameter A, define the objective 
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function 

p 

(3.1) QiH,P) = ^iy,-x'^fif + hXY,m, 

ieH j=i 

which is the Li penaHzed residual sum of squares based on a subsample 
HC{l,...,n} with \H\ = h. With 

(3.2) ^^ = argminQ(/7,/3), 

the sparse LTS estimator is given by 0^^^ ^ , where 

(3.3) Hopt= argmin Q{H,0fj). 

HC{l,...,n}:\H\=h 

Hence, the sparse LTS corresponds to finding the subset oi h<n observa- 
tions whose lasso fit produces the smallest penalized residual sum of squares. 
To find this optimal subset, we use an analogue of the FAST-LTS algorithm 
developed by Rousseeuw and Van Driessen (2006). 

The algorithm is based on concentration steps or C-steps. The C-step 
at iteration k consists of computing the lasso solution based on the cur- 
rent subset Hk, with \Hk\ = h, and constructing the next subset H^+i from 
the observations corresponding to the h smallest squared residuals. Let Hj. 
denote a certain subsample derived at iteration k and let f3jj be the coeffi- 
cients of the corresponding lasso fit. After computing the squared residuals 

^l = ('^i,!' ■ ■ ■ '"^m)' ^^^^ "^k.i = iVi ~ ^i/^Hfe)^' *h^ subsample F^+i for iter- 
ation k + 1 \s defined as the set of indices corresponding to the h smallest 
squared residuals. In mathematical terms, this can be written as 

Hk+i = {i G {1, . . . , n} : r| ^ G {{rl)j,n-j = 1, • ■ • , M}> 

where (r|)i:n < • • • < {^\)n:n denote the order statistics of the squared resid- 
uals. Let Phi. denote coefficients of the lasso fit based on Hk+i- Then 

(3.4) Q{Hk+i,0H,+-J < Q{Hk+i,0H,) < Q{Hk,0Hj, 

where the first inequality follows from the definition of f3fj , and the sec- 
ond inequality from the definition of H^. From (3.4) it follows that a C-step 
results in a decrease of the sparse LTS objective function, and that a se- 
quence of C-steps yields convergence to a local minimum in a finite number 
of steps. 

To increase the chances of arriving at the global minimum, a sufficiently 
large number s of initial subsamples Hq should be used, each of them being 
used as starting point for a sequence of C-steps. Rather than randomly 
selecting h data points, any initial subset Hq of size h is constructed from 
an elemental subset of size 3 as follows. Draw three observations from the 
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data at random, say, Xj^ , Xj2 and Xjg . The lasso fit for tliis elemental subset 
is then 

(3.5) P{ii,i2,i3} = argming({ii,i2,i3},/3), 

and the initial subset Hq is then given by the indices of the h observations 
with the smallest squared residuals with respect to the fit in (3.5). The 
nonsparse FAST-LTS algorithm uses elemental subsets of size p, since any 
OLS regression requires at least as many observations as the dimension p. 
This would make the algorithm not applicable if p > n. Fortunately the lasso 
is already properly defined for samples of size 3, even for large values of p. 
Moreover, from a robustness point of view, using only three observations is 
optimal, as it ensures the highest probability of not including outliers in the 
elemental set. It is important to note that the elemental subsets of size 3 are 
only used to construct the initial subsets of size h for the C-step algorithms. 
All C-steps are performed on subsets of size h. 

In this paper, we used s = 500 initial subsets. Using a larger number of 
subsets did not lead to better prediction performance in the case of the NCI 
data. Following the strategy advised in Rousseeuw and Van Driessen (2006), 
we perform only two C-steps for all s subsets and retain the si = 10 subsam- 
ples with the lowest values of the objective function (3.1). For the reduced 
number of subsets si, further C-steps are performed until convergence. This 
is a standard strategy for C-step algorithms to decrease computation time. 

Estimation of an intercept: the regression model in (1.1) does not con- 
tain an intercept. It is indeed common to assume that the variables are 
mean-centered and the predictor variables are standardized before apply- 
ing the lasso. However, computing the means and standard deviations over 
all observations does not result in a robust method, so we take a different 
approach. Each time the sparse LTS algorithm computes a lasso fit on a 
subsample of size h, the variables are first centered and the predictors are 
standardized using the means and standard deviations computed from the 
respective subsample. The resulting procedure then minimizes (1.4) with 
squared residuals rf = {yi — /3o — ^iP)'^, where /3o stands for the intercept. 
We verified that adding an intercept to the model has no impact on the 
breakdown point of the sparse LTS estimator of (3. 

4. Reweighted sparse LTS estimator. Let a denote the proportion of 
observations from the full sample to be retained in each subsample, that 
is, h = [(n + l)aj . In this paper we take a = 0.75. Then (1 — a) may be 
interpreted as an initial guess of the proportion of outliers in the data. This 
initial guess is typically rather conservative to ensure that outliers do not 
impact the results, and may therefore result in a loss of statistical efficiency. 
To increase efficiency, a reweighting step that downweights outliers detected 
by the sparse LTS estimator can be performed. 
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Under the normal error model, observations with standardized residuals 
larger than a certain quantile of the standard normal distribution may be 
declared as outliers. Since the sparse LTS estimator — like the lasso — is bi- 
ased, we need to center the residuals. A natural estimate for the center of 
the residuals is 

(4.1) /iraw = -^ ^ ri, 

where ri = yi — x^/^gp^i-scLTS ^^^ -ffopt is the optimal subset from (3.3). Then 
the residual scale estimate associated to the raw sparse LTS estimator is 
given by 



\^4.zj <7raw — "'a 



\ 



h 



1=1 



with squared centered residuals r^ = ((^i — Araw)^, • • • , {jn — Araw)^)', and 

/I ;.*-i{(a+l)/2) X-l/2 

(4.3) ka={- u^d^iu) 

\a J_$-1((q,_,.i)/2) 

a factor to ensure that iTraw is a consistent estimate of the standard deviation 
at the normal model. This formulation allows to define binary weights 

f4 4) u;=P' if|(^i-Araw)/<Traw|<^-ni-'5), . ^ -^ ^ 

^'' ' lO, if |(ri-Araw)/<Traw|>^-ni-'5), '"''■ 

In this paper 6 = 0.0125 is used such that 2.5% of the observations are ex- 
pected to be flagged as outliers in the normal model, which is a typical choice. 
The reweighted sparse LTS estimator is given by the weighted lasso fit 

n p 

(4.5) /3rcwcightcd = argmin^u;i(yi - x'^pf + Xn^^\Pj\, 

^ i=i j=i 

with n^j = X^iLi 'Wi the sum of weights. With the choice of weights given in 
(4.4), the reweighted sparse LTS is the lasso fit based on the observations not 
flagged as outliers. Of course, other weighting schemes could be considered. 
Using the residual center estimate 

1 '^ 

(,4.DJ /^reweighted = / ^ Wi[yi — XjPj.j,^j,jgjj^-jj(jj, 

4 = 1 

the residual scale estimate of the reweighted sparse LTS estimator is given by 

{^- 1 ) "^reweighted "-Ou 



A 



/ j '^i [Vi -^iPreweightcd A'rcwcightcd j ) 
1=1 



where ka^ is the consistency factor from (4.3) with a^ = n^/n. 
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Note that this reweighting step is conceptually different from the adaptive 
lasso by Zou (2006). While the adaptive lasso derives individual penalties on 
the predictors from initial coefficient estimates, the reweighted sparse LTS 
aims to include all nonoutlying observations into fitting the model. 

5. Choice of the penalty parameter. In practical data analysis, a suitable 
value of the penalty parameter A is not known in advance. We propose 
to select A by optimizing the Bayes Information Criterion (BIC), or the 
estimated prediction performance via cross-validation. In this paper we use 
the BIC since it requires less computational effort. The BIC of a given model 
estimated with shrinkage parameter A is given by 

(5.1) BIC(A) = log(a) + rf/(A)i^^, 

n 

where a denotes the corresponding residual scale estimate, (4.2) or (4.7), 
and df{\) are the degrees of freedom of the model. The degrees of freedom 
are given by the number of nonzero estimated parameters in (3 [see Zou, 
Hastie and Tibshirani (2007)]. 

As an alternative to the BIC, cross-validation can be used. To prevent out- 
liers from affecting the choice of A, a robust prediction loss function should 
be used. A natural choice is the root trimmed mean squared prediction error 
(RTMSPE) with the same trimming proportion as for computing the sparse 
LTS. In A;-fold cross-validation, the data are split randomly in k blocks of 
approximately equal size. Each block is left out once to fit the model, and 
the left-out block is used as test data. In this manner, and for a given value 
of A, a prediction is obtained for each observation in the sample. Denote the 
vector of squared prediction errors e^ = (ef , . . . , e^)'. Then 



(5.2) RTMSPE(A) 



h 



VU' 



To reduce variability, the RTMSPE may be averaged over a number of dif- 
ferent random splits of the data. 

The selected A then minimizes BIC(A) or RTMSPE(A) over a grid of 
values in the interval [0, Aq]. We take a grid with steps of size 0.025Ao, 
where Aq is an estimate of the shrinkage parameter Aq that would shrink all 
parameters to zero. If p > n, is of course excluded from the grid. For the 
lasso solution we take 

2 
(5.3) Aq = — max Cor(y,x,), 

njg{i,...,p} 

exactly the same as given and motivated in Efron et al. (2004). In (5.3), 
Cor(y,Xj) stands for the Pearson correlation between y and the jth column 
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of the design matrix X. For sparse LTS, we need a robust estimate Aq- We 
propose to replace the Pearson correlation in (5.3) by the robust correlation 
based on bivariate winsorization of the data [see Khan, Van Aelst and Zamar 
(2007)]. 

6. Simulation study. This section presents a simulation study for com- 
paring the performance of various sparse estimators. The simulations are 
performed in R [R Development Core Team (2011)] with package simFrame 
[Alfons, Tempi and Filzmoser (2010), Alfons (2012a)], which is a general 
framework for simulation studies in statistics. Sparse LTS is evaluated for 
the subset size h= [{n + l)0.75j . Both the raw and the reweighted version 
(see Section 4) are considered. We prefer to take a relatively large trimming 
proportion to guarantee a breakdown point of 25%. Adding the reweighting 
step will then increase the statistical efficiency of sparse LTS. We make a 
comparison with the lasso, the LAD-lasso and robust least angle regression 
(RLARS), discussed in the introduction. We selected the LAD-lasso estima- 
tor as a representative of the class of penalized M-estimators, since it does 
not need an initial residual scale estimator. 

For every generated sample, an optimal value of the shrinkage parameter 
A is selected. The penalty parameters for sparse LTS and the lasso are chosen 
using the BIC, as described in Section 5. For the LAD-lasso, we estimate 
the shrinkage parameter in the same way as in Wang, Li and Jiang (2007). 
However, if p > n, we cannot use their approach and use the BIC as in 
(5.1), with the mean absolute value of residuals (multiplied by a consistency 
factor) as scale estimate. For RLARS, we add the sequenced variables to the 
model in a stepwise fashion and fit robust MM-regressions [Yohai (1987)], as 
advocated in Khan, Van Aelst and Zamar (2007). The optimal model when 
using RLARS is then again selected via BIC, now using the robust scale 
estimate resulting from the MM-regression. 

6.1. Sampling schemes. The first configuration is a latent factor model 
taken from Khan, Van Aelst and Zamar (2007) and covers the case oi n> p. 
From k = 6 latent independent standard normal variables li , . . . , 1^ and an 
independent normal error variable e with standard deviation o", the response 
variable y is constructed as 

y:=li + --- + lfc + e, 

where a is chosen so that the signal-to-noise ratio is 3, that is, a = y/k/3. 
With independent standard normal variables ei, . . . , ep, a set of ;? = 50 can- 
didate predictors is then constructed as 

Xfc+1 := li + Se^+i, 
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XA;+2 :=ll +<5efc+2, 



fc-l 


= h + Se3k-i, 


X3fc 


= lfc + <5e3fc, 


Xj 


= ej, j = 



3/c + l,...,p, 

where r = 0.3 and 5 = 5 so that xi, . . . ,Xfc are low- noise perturbations of the 
latent variables, x^+i, . . . ,X3fc are noise covariates that are correlated with 
the latent variables, and xs^+i, . . . ,Xp are independent noise covariates. The 
number of observations is set to n = 150. 

The second configuration covers the case of moderate high-dimensional 
data. We generate n = 100 observations from a p-dimensional normal dis- 
tribution N{0,Ti), with p= 1000. The covariance matrix 51 = {^ij)i<ij<p 
is given by Ejj = 0.5'*"-^', creating correlated predictor variables. Using the 
coefficient vector /3 = (/3j)i<j<p with /3i = /Sy = 1.5, /32 = 0.5, P^ = /3ii = 1, 
and /3j = for j G {1, . . . ,p} \ {1, 2, 4, 7, 11}, the response variable is gener- 
ated according to the regression model (1.1), where the error terms follow a 
normal distribution with a = 0.5. 

Finally, the third configuration represents a more extreme case of high- 
dimensional data with n = 100 observations and p = 20,000 variables. The 
first 1000 predictor variables are generated from a multivariate normal dis- 
tribution A^(0, S) with Sjj =0.6'*"-''. Furthermore, the remaining 19,000 
covariates are standard normal variables. Then the response variable is gen- 
erated according to (1.1), where the coefficient vector /3 = iPj)i<j<p is given 
by f3j = 1 for 1 < j < 10 and 13 j = for 11 < j < p, and the error terms follow 
a standard normal distribution. 

For each of the three simulation settings, we apply contamination schemes 
taken from Khan, Van Aelst and Zamar (2007). To be more precise, we 
consider the following: 

(1) No contamination. 

(2) Vertical outliers: 10% of the error terms in the regression model follow 
a normal A^(20, a) instead of a A^(0,cr). 

(3) Leverage points: Same as in 2, but the 10% contaminated observa- 
tions contain high-leverage values by drawing the predictor variables from 
independent A^(50, 1) distributions. 

In addition, we investigate a fourth and more stressful outlier scenario. Keep- 
ing the contamination level at 10%, outliers in the predictor variables are 
drawn from independent A^(10,0.01) distributions. Note the small standard 
deviation such that the outliers form a dense cluster. Let Xj denote such a 
leverage point. Then the values of the response variable of the contaminated 
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observations are generated by yi = 77x^7 with 7 = (— l/p)i<j<p- The direc- 
tion of 7 is very different from the one of the true regression parameter /3 in 
the following ways. First, 7 is not sparse. Second, all predictors have a neg- 
ative effect on the response in the contaminated observations, whereas the 
variables with nonzero coefficients have a positive effect on the response in 
the good data points. Fm'thermore, the parameter rj controls the magnitude 
of the leverage effect and is varied from 1 to 25 in five equidistant steps. 

This results in a total of 12 different simulations schemes, which we think 
to be representative for the many different simulation designs we tried out. 
The first scheme has n > p, the second setting has p > n, and the third 
setting has p^ n. The choices for the contamination schemes are standard, 
inducing both vertical outliers and leverage points in the samples. 

6.2. Performance measures. Since one of the aims of sparse model es- 
timation is to improve prediction performance, the different estimators are 
evaluated by the root mean squared prediction error (RMSPE). For this 
purpose, n additional observations from the respective sampling schemes 
(without outliers) are generated as test data, and this in each simulation 
run. Then the RMSPE is given by 



RMSPE(/3) 



\ 



-Y,{y:-^'0f 



where y* and x*, i = 1, . . . ,n, denote the observations of the response and 
predictor variables in the test data, respectively. The RMSPE of the or- 
acle estimator, which uses the true coefficient values /3, is computed as a 
benchmark for the evaluated methods. We report average RMSPE over all 
simulation runs. 

Concerning sparsity, the estimated models are evaluated by the false pos- 
itive rate (FPR) and the false negative rate (FNR). A false positive is a 
coefficient that is zero in the true model, but is estimated as nonzero. Anal- 
ogously, a false negative is a coefficient that is nonzero in the true model, 
but is estimated as zero. In mathematical terms, the FPR and FNR are 
defined as 

FPR(^) 



FNR(/3) 



|{j£{l,...,ff}:/3,/OA/3,=0}| 
liiG {!,..., p}:/3,=0}| 

|{jG{l,...,p}:/3,=OA/3,/0}| 



Both FPR and FNR should be as small as possible for a sparse estimator and 
are averaged over all simulation runs. Note that false negatives in general 
have a stronger effect on the RMSPE than false positives. A false negative 
means that important information is not used for prediction, whereas a false 
positive merely adds a bit of variance. 
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Table 1 

Results for the first simulation scheme, with n = 150 and p = 50. Root mean squared 

prediction error (RMSPE), the false positive rate (FPR) and the false negative rate 

(FNR), averaged over 500 simulation runs, are reported for every method 





No contamination 


Vertical outliers 


Leverage points 


Method 


RMSPE 


FPR 


FNR 


RMSPE 


FPR 


FNR 


RMSPE 


FPR FNR 


Lasso 


1.18 


0.10 


0.00 


2.44 


0.54 


0.09 


2.20 


0.00 0.16 


LAD-lasso 


1.13 


0.05 


0.00 


1.15 


0.07 


0.00 


1.27 


0.18 0.00 


RLARS 


1.14 


0.07 


0.00 


1.12 


0.03 


0.00 


1.22 


0.09 0.00 


Raw sparse LTS 


1.29 


0.34 


0.00 


1.26 


0.32 


0.00 


1.26 


0.26 0.00 


Sparse LTS 


1.24 


0.22 


0.00 


1.22 


0.25 


0.00 


1.22 


0.18 0.00 


Oracle 


0.82 






0.82 






0.82 





6.3. Simulation results. In this subsection the simulation results for the 
different data configurations are presented and discussed. 

6.3.1. Results for the first sampling scheme. The simulation results for 
the first data configuration are displayed in Table 1. Keep in mind that this 
configuration is exactly the same as in Khan, Van Aelst and Zamar (2007), 
and that the contamination settings are a subset of the ones applied in 
their paper. In the scenario without contamination, LAD-lasso, RLARS and 
lasso show excellent performance with low RMSPE and FPR. The prediction 
performance of sparse LTS is good, but it has a larger FPR than the other 
three methods. The reweighting step clearly improves the estimates, which 
is reflected in the lower values for RMSPE and FPR. Furthermore, none of 
the methods suffer from false negatives. 

In the case of vertical outliers, the nonrobust lasso is clearly influenced 
by the outliers, reflected in the much higher RMSPE and FPR. RLARS, 
LAD-lasso and sparse LTS, on the other hand, keep their excellent behavior. 
Sparse LTS still has a considerable tendency toward false positives, but the 
reweighting step is a significant improvement over the raw estimator. 

When leverage points are introduced in addition to the vertical outliers, 
the performance of RLARS, sparse LTS and LAD-lasso is comparable. The 
FPR of RLARS and LAD-lasso slightly increased, whereas the FPR of sparse 
LTS slightly decreased. The LAD-lasso still performs well, and even the lasso 
performs better than in the case of only vertical outliers. This suggests that 
the leverage points in this example do not have a bad leverage effect. 

In Figure 1 the results for the fourth contamination setting are shown. 
The RMSPE is thereby plotted as a function of the parameter rj. With 
increasing rj, the RMSPE of the lasso and the LAD-lasso increases. RLARS 
has a considerably higher RMSPE than sparse LTS for lower values of r], but 
the RMSPE gradually decreases with increasing rj. However, the RMSPE of 
sparse LTS remains the lowest, thus, it has the best overall performance. 
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Lasso Raw sparse LTS 








1 1 1 1 1 
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5 - 
LU 

^ 4- 

CC 3 - 

2 - 




1 - 









1 7 13 19 25 

n (controlling the magnitude of the leverage effect) 

Fig. 1. Root mean squared prediction error (RMSPE) for the first simulation scheme, 
with n = 150 and p = 50, and for the fourth contamination setting, averaged over 500 
simulation runs. Lines for raw and reweighted sparse LTS almost coincide. 

6.3.2. Results for the second sampling scheme. Table 2 contains the sim- 
ulation results for the moderate high-dimensional data configuration. In the 
scenario without contamination, RLARS and the lasso perform best with 
very low RMSPE and almost perfect FPR and FNR. Also, the LAD-lasso 
has excellent prediction performance, followed by sparse LTS. The LAD- 
lasso leads to a slightly higher FPR than the other methods, though. When 
vertical outliers are added, RLARS still has excellent prediction performance 
despite some false negatives. We see that the sparse LTS performs best here. 
In addition, the prediction performance of the nonrobust lasso already suffers 
greatly from the vertical outliers. In the scenario with additional leverage 
points, sparse LTS remains stable and is still the best. For RLARS, sparsity 



Table 2 

Results for the second simulation scheme, with n — 100 and p — 1000. Root mean squared 

prediction error (RMSPE), the false positive rate (FPR) and the false negative rate 

(FNR), averaged over 500 simulation runs, are reported for every method 





No contamination 


Vertical outliers 


Leverage points 


Method 


RMSPE 


FPR 


FNR 


RMSPE 


FPR 


FNR 


RMSPE 


FPR FNR 


Lasso 


0.62 


0.00 


0.00 


2.56 


0.08 


0.16 


2.53 


0.00 0.71 


LAD-lasso 


0.66 


0.08 


0.00 


0.82 


0.00 


0.01 


1.17 


0.08 0.01 


RLARS 


0.60 


0.01 


0.00 


0.73 


0.00 


0.10 


0.92 


0.02 0.09 


Raw sparse LTS 


0.81 


0.02 


0.00 


0.73 


0.02 


0.00 


0.73 


0.02 0.00 


Sparse LTS 


0.74 


0.01 


0.00 


0.69 


0.01 


0.00 


0.71 


0.02 0.00 


Oracle 


0.50 






0.50 






0.50 
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Lasso 

LAD-lasso 

RLARS 



Raw sparse LTS 
Sparse LTS 
Oracle 




1 7 13 19 25 

Ti (controlling the magnitude of the leverage effect) 

Fig. 2. Root mean squared prediction error (RMSPE) for the second simulation scheme, 
with n = 100 and p = 1000, and for the fourth contamination setting, averaged over 500 
simulation runs. Lines for raw and reweighted sparse LTS almost coincide. 

behavior according to FPR and FNR does not change significantly either, 
but there is a small increase in the RMSPE. On the other hand, LAD-lasso 
already has a considerably larger RMSPE than sparse LTS, and again a 
slightly higher FPR than the other methods. Furthermore, the lasso is still 
highly influenced by the outliers, which is reflected in a very high FNR and 
poor prediction performance. 

The results for the fourth contamination setting are presented in Fig- 
ure 2. As for the previous simulation scheme, the RMSPE for the lasso and 
the LAD-lasso is increasing with increasing parameter rj. The RMSPE for 
RLARS, however, is gradually decreasing. Sparse LTS shows particularly 
interesting behavior: the RMSPE is close to the oracle at first, then there is 
a kink in the curve (with the value of the RMSPE being in between those 
for the LAD-lasso and the lasso), after which the RMSPE returns to low 
values close to the oracle. In any case, for most of the investigated values of 
rj, sparse LTS has the best performance. 

6.3.3. Results for the third sampling scheme. Table 3 contains the sim- 
ulation results for the more extreme high-dimensional data configuration. 
Note that the LAD-lasso was no longer computationally feasible with such 
a large number of variables. In addition, the number of simulation runs was 
reduced from 500 to 100 to lower the computational effort. 

In the case without contamination, the sparse LTS suffers from an ef- 
ficiency problem, which is reflected in larger values for RMSPE and FNR 
than for the other methods. The lasso and RLARS have considerably better 
performance in this case. With vertical outliers, the RMSPE for the lasso in- 
creases greatly due to many false negatives. Also, RLARS has a larger FNR 
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Table 3 

Results for the third simulation scheme, with n = 100 and p = 20,000. Root mean squared 

prediction error (RMSPE), the false positive rate (FPR) and the false negative rate 

(FNR), averaged over 100 simulation runs, are reported for every method 



No contamination 



Vertical outliers 



Leverage points 



Method 


RMSPE 


FPR 


FNR 


RMSPE 


FPR 


FNR 


RMSPE 


FPR 


FNR 


Lasso 


1.43 


0.000 


0.00 


5.19 


0.004 


0.49 


5.57 


0.000 


0.83 


RLARS 


1.54 


0.001 


0.00 


2.53 


0.000 


0.38 


3.34 


0.001 


0.45 


Raw sparse LTS 


3.00 


0.001 


0.19 


2.59 


0.002 


0.11 


2.59 


0.002 


0.10 


Sparse LTS 


2.88 


0.001 


0.16 


2.49 


0.002 


0.10 


2.57 


0.002 


0.09 


Oracle 


1.00 






1.00 






1.00 







than sparse LTS, resulting in a slightly lower RMSPE for the reweighted ver- 
sion of the latter. When leverage points are introduced, sparse LTS clearly 
exhibits the lowest RMSPE and FNR. Furthermore, the lasso results in a 
very large FNR. 

Figure 3 shows the results for the fourth contamination setting. Most 
interestingly, the RMSPE of RLARS in this case keeps increasing in the 
beginning and even goes above the one of the lasso, before dropping dropping 
continuously in the remaining steps. Sparse LTS again shows a kink in the 
curve for the RMSPE, but clearly performs best. 

6.3.4. Summary of the smulation results. Sparse LTS shows the best 
overall performance in this simulation study, if the reweighted version is 
taken. Concerning the other investigated methods, RLARS also performs 



Lasso 
RLARS 



Raw sparse LTS 
Sparse LTS 
Oracle 



m 




r| (controlling the magnitude of the leverage effect) 



Fig. 3. Root mean squared prediction error (RMSPE) for the third simulation scheme, 
with n = 100 and p — 20,000, and for the fourth contamination setting, averaged over 100 
simulation runs. Lines for raw and reweighted sparse LTS almost coincide. 
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well, but suffers sometimes from an increased percentage of false negatives 
under contamination. It is also confirmed that the lasso is not robust to 
outliers. The LAD-lasso still sustains vertical outliers, but is not robust 
against bad leverage points. 

7. NCI-60 cancer cell panel. In this section the sparse LTS estimator 
is compared to the competing methods in an application to the cancer cell 
panel of the National Cancer Institute. It consists of data on 60 human 
cancer cell lines and can be downloaded via the web application CellMiner 
(http://discover.nci.nih.gov/cellininer/). We regress protein expres- 
sion on gene expression data. The gene expression data were obtained with 
an Affymetrix HG-U133A chip and normalized with the GCRMA method, 
resulting in a set of p = 22,283 predictors. The protein expressions based 
on 162 antibodies were acquired via reverse-phase protein lysate arrays and 
log2 transformed. One observation had to be removed since all values were 
missing in the gene expression data, reducing the number of observations 
to n = 59. More details on how the data were obtained can be found in 
Shankavaram et al. (2007). Furthermore, Lee et al. (2011) also use this data 
for regression analysis, but consider only nonrobust methods. They obtain 
models that still consist of several hundred to several thousand predictors 
and are thus difficult to interpret. 

Similar to Lee et al. (2011), we first order the protein expression variables 
according to their scale, but use the MAD (median absolute deviation from 
the median, multiplied with the consistency factor 1.4826) as a scale estima- 
tor instead of the standard deviation. We show the results for the protein 
expressions based on the KRT18 antibody, which constitutes the variable 
with the largest MAD, serving as one dependent variable. Hence, our re- 
sponse variable measures the expression levels of the protein keratin 18, 
which is known to be persistently expressed in carcinomas [Oshima, Barib- 
ault and Caulin (1996)]. We compare raw and reweighted sparse LTS with 
25% trimming, lasso and RLARS. As in the simulation study, the LAD-lasso 
could not be computed for such a large p. The optimal models are selected 
via BIC as discussed in Section 5. The raw sparse LTS estimator thereby 
results in a model with 32 genes. In the reweighting step, one more obser- 
vation is added to the best subset found by the raw estimator, yielding a 
model with 33 genes for reweighted sparse LTS (thus also one more gene 
is selected compared to the raw estimator). The lasso model is somewhat 
larger with 52 genes, whereas the RLARS model is somewhat smaller with 
18 genes. 

Sparse LTS and the lasso have three selected genes in common, one of 
which is KRT8. The product of this gene, the protein keratin 8, typically 
forms an intermediate filament with keratin 18 such that their expression 
levels are closely linked [e.g., Owens and Lane (2003)]. However, the larger 



18 A. ALFONS, C. CROUX AND S. GELPER 

Table 4 

Root trimmed m,ean squared prediction error 

(RTMSPE) for protein expressions based on the 

KRT18 antibody (NCI-60 cancer cell panel data), 

computed from leave-one-out cross-validation 

Method RTMSPE 

Lasso 1.058 

RLARS 0.936 

Raw sparse LTS 0.727 

Sparse LTS 0.721 



model of the lasso is much more difficult to interpret. Two of the genes 
selected by the lasso are not even recorded in the Gene database [Maglott 
et al. (2005)] of the National Center for Biotechnology Information (NCBI). 
The sparse LTS model is considerably smaller and easier to interpret. For 
instance, the gene expression level of MSLN, whose product mesothelin is 
overexpressed in various forms of cancer [Hassan, Bera and Pastan (2004)], 
has a positive effect on the protein expression level of keratin 18. 

Concerning prediction performance, the root trimmed mean squared pre- 
diction error (RTMSPE) is computed as in (5.2) via leave-one-out cross- 
validation (so k = n). Table 4 reports the RTMSPE for the considered meth- 
ods. Sparse LTS clearly shows the smallest RTMSPE, followed by RLARS 
and the lasso. In addition, sparse LTS detects 13 observations as outliers, 
showing the need for a robust procedure. Further analysis revealed that 
including those 13 observations changes the correlation structure of the pre- 
dictor variables with the response. Consequently, the order in which the 
genes are added to the model by the lasso algorithm on the full sample is 
completely different from the order on the best subset found by sparse LTS. 
Leaving out those 13 observations therefore yields more reliable results for 
the majority of the cancer cell lines. 

It is also worth noting that the models still contain a rather large number 
of variables given the small number of observations. For the lasso, it is well 
known that it tends to select many noise variables in high dimensions since 
the same penalty is applied on all variables. Meinshausen (2007) therefore 
proposed a relaxation of the penalty for the selected variables of an initial 
lasso fit. Adding such a relaxation step to the sparse LTS procedure may 
thus be beneficial for large p and is considered for future work. 

8. Computational details and CPU times. All computations are car- 
ried out in R version 2.14.0 [R Development Core Team (2011)] using the 
packages rohustHD [Alfons (2012b)] for sparse LTS and RLARS, quantreg 
[Koenker (2011)] for the LAD-lasso and lars [Hastie and Efron (2011)] for 
the lasso. Most of sparse LTS is thereby implemented in C++, while RLARS 



SPARSE LEAST TRIMMED SQUARES REGRESSION 



19 



Lasso 
LAD-lasso 



RLARS 
Sparse LTS 




10'^2.0 



10'^2.5 



lO'^S.O 



10'^3.5 



10M.0 



Fig. 4. CPU times (in seconds) for n— 100 and varying p, averaged over 10 runs. 



is an optimized version of the R code by Khan, Van Aelst and Zamar (2007). 
Optimization of the RLARS code was necessary since the original code builds 
a, p X p matrix of robust correlations, which is not computationally feasible 
for very large p. The optimized version only stores an q x p matrix, where q 
is the number of sequenced variables. Furthermore, the robust correlations 
are computed with C++ rather than R. 

Since computation time is an important practical consideration, Figure 4 
displays computation times of lasso, LAD-lasso, RLARS and sparse LTS in 
seconds. Note that those are average times over 10 runs based on simulated 
data with n = 100 and varying dimension p, obtained on an Intel Xeon X5670 
machine. For sparse LTS and the LAD-lasso, the reported CPU times are 
averages over a grid of five values for A. RLARS is a hybrid procedure, thus, 
we only report the CPU times for obtaining the sequence of predictors, but 
not for fitting the models along the sequence. 

As expected, the computation time of the nonrobust lasso remains very 
low for increasing p. Sparse LTS is still reasonably fast up to p ~ 10,000, 
but computation time is a considerable factor if p is much larger than that. 
However, sparse LTS remains faster than obtaining the RLARS sequence. 
A further advantage of the subsampling algorithm of sparse LTS is that it 
can easily be parallelized to reduce computation time on modern multicore 
computers, which is future work. 

9. Conclusions and discussion. Least trimmed squares (LTS) is a robust 
regression method frequently used in practice. Nevertheless, it does not al- 
low for sparse model estimates and cannot be applied to high-dimensional 
data with p > n. This paper introduced the sparse LTS estimator, which 
overcomes these two issues simultaneously by adding an Li penalty to the 
LTS objective function. Simulation results and a real data application to 
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protein and gene expression data of the NCI-60 cancer cell panel illustrated 
the excellent performance of sparse LTS and showed that it performs as 
well or better than robust variable selection methods such as RLARS. In 
addition, an advantage of sparse LTS over algorithmic procedures such as 
RLARS is that the objective function allows for theoretical investigation of 
its statistical properties. As such, we could derive the breakdown point of 
the sparse LTS estimator. However, it should be noted that efficiency is an 
issue with sparse LTS. A reweighting step can thereby lead to a substantial 
improvement in efficiency, as shown in the simulation study. 

In the paper, an Li penalization was imposed on the regression parameter, 
as for the lasso. Other choices for the penalty are possible. For example, an 
L2 penalty leads to ridge regression. A robust version of ridge regression was 
recently proposed by Maronna (2011), using L2 penalized MM-estimators. 
Even though the resulting estimates are not sparse, prediction accuracy is 
improved by shrinking the coefficients, and the computational issues with 
high-dimensional robust estimators are overcome due to the regularization. 
Another possible choice for the penalty function is the smoothly clipped 
absolute deviation penalty (SCAD) proposed by Fan and Li (2001). It sat- 
isfies the mathematical conditions for sparsity but results in a more difficult 
optimization problem than the lasso. Still, a robust version of SCAD can 
be obtained by optimizing the associated objective function over trimmed 
samples instead of over the full sample. 

There are several other open questions that we leave for future research. 
For instance, we did not provide any asymptotics for sparse LTS, as was, for 
example, done for penalized M-estimators in Germain and Roueff (2010). 
Potentially, sparse LTS could be used as an initial estimator for computing 
penalized M-estimators. 

All in all, the results presented in this paper suggest that sparse LTS is 
a valuable addition to the statistics researcher's toolbox. The sparse LTS 
estimator has an intuitively appealing definition and is related to the popu- 
lar least trimmed squares estimator of robust regression. It performs model 
selection, outlier detection and robust estimation simultaneously, and is ap- 
plicable if the dimension is larger than the sample size. 

APPENDIX: PROOF OF BREAKDOWN POINT 

Proof of Theorem 1. In this proof the Li norm of a vector /3 is 
denoted as ||/3||i and the Euclidean norm as ||/9||2- Since these norms are 
topologically equivalent, there exists a constant ci > such that ||/9||i > 
C1II/3II2 for all vectors j3. The proof is split into two parts. 

First, we prove that e*(^; Z) > "~ "'"^ . Replace the last m <n — h observa- 
tions, resulting in the contaminated sample Z. Then there are still n — m>h 
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good observations in Z. Let My = niaxi<j<„ \yi\ and M^^ = maxi<j<„ |xii|. 
For the case /3j = 0, j = 1, . . . ,p, the vahie of the objective function is given 
by 

h h 

Q(o) = Y.^p{y%.n < E(/^(y))- ^ MM,). 

Now consider any /3 with ||/3||2 > M := {hp{My) + l)/(Aci). For the value of 
the objective function, it holds that 

Q(/3) > A||/3||i > XciWPh > hpiMy) + 1 > g(0). 

Since Q0) < Q{0), we conclude that ||,9(Z)||2 < M, where M does not 
depend on the outliers. This concludes the first part of the proof. 

Second, we prove that e*{(3; Z) < ""^"'"^ . Move the last m = n — h+1 ob- 
servations of Z to the position z(7, r) = (x(r)', 1/(7, r))' = ((r, 0, . . . , 0), 7r)' 
with 7,r > 0, and denote Z^^,- the resulting contaminated sample. Assume 
that there exists a constant M such that 

(A.l) supp(Z^,0ll2<M, 

that is, there is no breakdown. We will show that this leads to a contradic- 
tion. 

Let /3^ = (7, 0, . . . , 0)' G W with 7 = M + 2 and define r > such that 
p(r) > max(/i — m,0)p{My + 7M2;j) + /iA7 + 1. Note that r is always well 
defined due to the assumptions on p, in particular, since p{oo) = 00. Then 
the objective function is given by 

h—m 
i=l 

, /1AI7I, else, 

since the residuals with respect to the outliers are all zero. Hence, 
(A.2) Q{p^) < max(/i - m,0)p{My + -fM^,) + hX-f < p{t) - 1. 
Furthermore, for /3 = (/3i, . . . , /3p)' with \\(3\\2 < 7 — 1 we have 

Q{(3)>p{jT-T(3i), 

since at least one outlier will be in the set of the smallest h residuals. Now 

/3i<||/9||2<7-l, so that 

(A.3) Q(/3)>p(t(7-/3i))>/9(t), 

since p is nondecreasing. 

Combining (A.2) and (A.3) leads to 

||^(Z^,r)|l2>7-l=M + l, 

which contradicts the assumption (A.l). Hence, there is breakdown. D 
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